In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In [ ]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/human_body_temperature.csv')
In [ ]:
df.info()
df.head()
Out[ ]:
We first start by viewing the histogram of the human body temperatures.
In [ ]:
# Plots the histogram of temperatures
import matplotlib.pyplot as plt
import seaborn as sns
temperature = df['temperature']
sns.set()
plt.hist(temperature, bins='auto', normed=True)
plt.xlabel('Temperature(F)')
plt.ylabel('Count')
plt.title('Human Body Temperature')
plt.show()
It is difficult to conclude whether this data is normally distributed from this histogram alone. A better visual would be made by using the empirical CDF and CDF of the temperature data.
In [ ]:
# Plots the ECDF and CDF of the human body temperatures
def ecdf(data):
"""
Compute ECDF for a one-dimensional array of measurements.
Returns tuple of arrays (x,y) that contain x and y values for ECDF.
"""
x = np.sort(data)
y = np.arange(1, len(x) + 1) / len(x)
return x, y
x_ecdf, y_ecdf = ecdf(temperature)
temperature_theoretical = np.random.normal(np.mean(temperature), np.std(temperature), size=10000)
x_theoretical_cdf, y_theoretical_cdf = ecdf(temperature_theoretical)
plt.plot(x_ecdf, y_ecdf, marker='.', linestyle='none')
plt.plot(x_theoretical_cdf, y_theoretical_cdf)
plt.xlabel('Temperature(F)')
plt.ylabel('CDF')
plt.title('Human Body Temperature')
plt.legend(('CDF', 'ECDF'), loc='lower right')
plt.show()
The ECDF and CDF on the graph above seem to allign together implying that the temperature data is likely normally distributed. We can perform a normal test to double check.
In [ ]:
# Performs normal test
import scipy.stats as stats
def isNormal(data):
z, p = stats.mstats.normaltest(data)
if (p < 0.055):
print('The data is more likely NOT normally distributed')
else:
print('The data is more likely normally distributed')
isNormal(temperature)
The rule of thumb for the Central Limit Theorem is that a sample size of 30 or more is considered a large sample size. The sample size is large since we have a sample size of 130. The observations are also independent.
A one-sample test should be used here because we only have one set of data available that we will compare to single mean. It is appropriate to use the $z$ statistic in this case because the sample size is 30 or greater. The $t$ statistic should be used if the sample size is less than 30.
In [ ]:
df.describe()
Out[ ]:
We will now perform a bootstrap hypothesis test with the following:
$H_0$: The mean of the sample and the true mean of 98.6 are the same. $\mu=\mu_0$
$H_A$: The means are different. $\mu\neq\mu_0$
In [ ]:
# Calculates p value using 100,000 boostrap replicates
bootstrap_replicates = np.empty(100000)
size = len(bootstrap_replicates)
for i in range(size):
bootstrap_sample = np.random.choice(temperature, size=len(temperature))
bootstrap_replicates[i] = np.mean(bootstrap_sample)
p = np.sum(bootstrap_replicates >= 98.6) / len(bootstrap_replicates)
print('p =', p)
The p value is extremely small after 100,000 replicates. This implies that the true mean is different from 98.6 degrees F
We can repeate the hypothesis test by also calculating the z-score to verify our results above.
$z\_score$ = $(sample\_mean - population\_mean)$ / $population\_standard\_deviation$
Since we do not know the population's standard deviation we can approximate it to be:
$population\_standard\_deviation$ $\approx$ $sample\_standard\_deviation$ / $sample\_size^{0.5}$
In [ ]:
# Calculates z and p values and performs z test
z = (np.mean(temperature) - 98.6) / (np.std(temperature) / np.sqrt(len(temperature)))
print('z =', z)
p_z = stats.norm.sf(abs(z))*2
print('p = p(z >= 5.476) + p(z <= -5.476) =', p_z)
The p value is extremely small which confirms that the true mean is likely different from 98.6. We will compare the results with the t statistic. The $t$ and $z$ values should be approximately the same.
In [ ]:
# Performs t test
t = z
print('t =', t)
p_t = stats.t.sf(np.abs(t), len(temperature)-1)*2
print('p = p(t >= 5.476) + p(t <= -5.476) =', p_t)
The p value from the $t$ test is different but it still implies that the null hypothesis is false.
Since we will be drawing a random sample of size 10, the $t$ statistic will not be more appropriate to use.
In [ ]:
# Draws random sample of 10
sample = np.random.choice(temperature, size=10)
sample
Out[ ]:
In [ ]:
# Performs t test
t2 = (np.mean(sample) - 98.6) / (np.std(sample) / np.sqrt(len(sample)))
print('t =', t2)
p_t2 = stats.t.sf(np.abs(t), len(sample)-1)*2
print('p = ', p_t2)
In [ ]:
# Performs z test
z2 = (np.mean(sample) - 98.6) / (np.std(sample) / np.sqrt(len(sample)))
print('z =', z2)
p_z2 = stats.norm.sf(abs(z))*2
print('p =', p_z2)
The p values for the t and z tests are significantly different. This shows that if you apply the wrong test to a problem you can end up with an incorrect result. It is important to know when it is appropriate to apply the $z$ statistic and the $t$ statistic. When the sample size is less than 30, the $t$ statistic should be used.
In [ ]:
# Calculates margin of error for sample mean with 95% confidence
print('The mean temperature of the data is', np.mean(temperature))
z = 1.96 # this is the value of z for 95% confidence
error = z * np.std(temperature) / np.sqrt(len(temperature))
print('margin of error for a sample mean =', error)
The average temperatures of all humans is estimated with 95% confidence to be 98.25 +/- 0.126 or between 98.124 and 98.376 in degrees Fahrenheit. If we define an "abnormal" temperature to be outside of the range of the mean, this would include all temperatures greater than 98.376 and less than 98.124.
In [ ]:
# Calculates 95% confidence interval
confidence_interval = np.percentile(temperature, [2.5, 97.5])
print('We expect 95% of the temperature data to be between', confidence_interval[0], 'and', confidence_interval[1])
If we define an "abnormal" temperature to be outside the 95% confidence interval, this would include temperatures greater than 99.478 and less than 96.723.
A two-sample permutation test with the differences in means will be appropriate for this problem. A permuatation test is appropriate for this because we will be testing whether males and females have the same distribution and similar mean temperatures. First we should visualize the data with exploratory data analysis.
In [ ]:
# Plots the ECDF for the temperatures of males and females
male_temperature = df[df['gender'] == 'M']['temperature']
female_temperature = df[df['gender'] == 'F']['temperature']
x_male, y_male = ecdf(male_temperature)
x_female, y_female = ecdf(female_temperature)
plt.plot(x_male, y_male, marker='.', linestyle='none', color='red')
plt.plot(x_male, y_male, marker='.', linestyle='none', color='blue')
plt.xlabel('Temperature(F)')
plt.ylabel('ECDF')
plt.legend(('Male', 'Female'), loc='lower right')
plt.title('Male vs Female: Human Body Temperature')
plt.show()
male_and_female_diff = np.abs(np.mean(male_temperature) - np.mean(female_temperature))
print('The difference between the male and female mean temperatures is', male_and_female_diff)
We can see that the male and female ECDF graphs overlap which tells us that there is a small difference between the two data sets to begin with (0.289). We can now continue with hypothesis testing to see if this difference is due to the differenes in gender or by chance.
$H_0$: There is no difference in the distribution and means of males and females.
$H_A$: There is a difference in the distribution and means of males and females.
In [ ]:
permutation_replicates = np.empty(100000)
size = len(permutation_replicates)
for i in range(size):
combined_perm_temperatures = np.random.permutation(np.concatenate((male_temperature, female_temperature)))
male_permutation = combined_perm_temperatures[:len(male_temperature)]
female_permutation = combined_perm_temperatures[len(male_temperature):]
permutation_replicates[i] = np.abs(np.mean(male_permutation) - np.mean(female_permutation))
p_val = np.sum(permutation_replicates >= male_and_female_diff) / len(permutation_replicates)
print('p =', p_val)
The small p value is less than 0.055 which shows that the difference in the means of male and female temperatures is statistically significant. We can reject the null hypothesis ($H_0$).
The mean normal body temperature was held to be 37$^∘$ C or 98.6$^∘$ F for more than 120 years since it was first conceptualized and reported by Carl Wunderlich in a famous 1868 book. However, this value is not statistically correct. The mean normal body temperature was computed with 95% confidence to be between 98.124 and 98.376. There is also a statistically significant difference in the means between males and females.